Table of Contents

<<<<<<< HEAD
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
  EOF within quoted string

Attaching package: ‘lubridate’

The following object is masked from ‘package:base’:

    date


Attaching package: ‘plotly’

The following object is masked from ‘package:ggplot2’:

    last_plot

The following object is masked from ‘package:stats’:

    filter

The following object is masked from ‘package:graphics’:

    layout
======= >>>>>>> master

Add new variables with mutate()

Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. That’s the job of mutate().

mutate() always adds new columns at the end of your dataset, but doesn’t change the original dataframe. If you want to keep the outcome of mutate(), or any of the other functions we discussed in the previous notebook, assign the output to a new or existing object.

# Starting with a smaller number of columns
flights_sml <- select(flights, 
                      year:day, 
                      ends_with("delay"), 
                      distance, 
                      air_time
                     )

flights_sml
mutate(flights_sml,
       gain = dep_delay - arr_delay,
       speed = distance / air_time * 60
      )

Alternatively we could’ve used our piping skills:

flights %>%
  select(year:day, 
         ends_with("delay"), 
         distance, 
         air_time
        ) %>%
  mutate(gain = dep_delay - arr_delay,
         speed = distance / air_time * 60)

Note that you can refer to columns that you’ve just created:

mutate(flights_sml,
       gain = dep_delay - arr_delay,
       hours = air_time / 60,
       gain_per_hour = gain / hours
      )

Grouped summaries with summarize()

The last key verb is summarize() (or summarise()). It collapses a data frame to a single row:

# Average delay
summarize(flights, delay = mean(dep_delay, na.rm = TRUE))

What happens if we don’t specify na.rm = TRUE?

group_by()

summarize() is not terribly useful unless we pair it with group_by(). This changes the unit of analysis from the complete dataset to individual groups. Then, when you use the dplyr verbs on a grouped data frame they’ll automatically be applied “by group”. For example, if we applied exactly the same code to a data frame grouped by date, we get the average delay per date:

# Average delay for each day
flights %>%
  group_by(year, month, day) %>%
  summarize(delay = mean(dep_delay, na.rm = TRUE))

Examples

How does the average distance to destination impact the arrival delay on average?

delay <- flights %>% 
  group_by(dest) %>%  # group by destination
  summarize(count = n(),  # counting the number of flights (per destination)
            Distance = mean(distance, na.rm = TRUE),  # average distance (per destination)
            Delay = mean(arr_delay, na.rm = TRUE)  # average delay (per destination)
           ) %>%
  filter(count > 20, dest != "HNL")  # keeping destinations with more than 20 flights that are not "HNL"

delay
p <- ggplot(data = delay, mapping = aes(x = Distance, y = Delay)) +
  geom_point(aes(size = count, color = dest), alpha = 1/3) +
  geom_smooth(se = FALSE) +
  labs(title = "Average distance to destination vs. arrival delay",
      caption = "The circle size shows the number of flights to that destination.")

ggplotly(p + theme(legend.position = "none"))  # Same plot with ggplotly()
`geom_smooth()` using method = 'loess' and formula 'y ~ x'
<<<<<<< HEAD
=======
>>>>>>> master

It looks like delays increase with distance up to ~750 miles and then decrease. Maybe as flights get longer there’s more ability to make up delays in the air?

Can you figure out why we removed “HNL” (Pacific/Honolulu)?


Let’s examine if the number of flights during the summer has increased:

flight_count <- flights %>%
  mutate(date = make_datetime(year, month, day)) %>%
  group_by(date) %>%
  summarize(count = n(),  # counting the number of flights (per day)
            delay = mean(arr_delay, na.rm = TRUE)  # average delay (per day)
           )

p <- ggplot(flight_count, aes(date, count)) +
  geom_line() +
  labs(title = "Number of flights out of NYC") +
  theme_classic()

ggplotly(p)
<<<<<<< HEAD
=======
>>>>>>> master

Interestingly, all of these dips are weekends. Let’s check this by creating a bar chart:

flights %>%
  mutate(date = make_datetime(year, month, day),
        weekday = wday(date, label = TRUE)) %>%
  group_by(weekday) %>%
  summarize(count = n(),  # counting the number of flights (per weekday)
            delay = mean(arr_delay, na.rm = TRUE)  # average delay (per weekday)
           ) %>% 
  ggplot(aes(weekday, count)) + 
    geom_bar(stat="identity", width = 0.5) + 
    labs(title="Number of flights per dat of week",
        x = "Day of Week",
        y = "Count") +
    theme_classic() +
    theme(axis.text.x = element_text(angle=65, vjust=0.6))

LS0tCnRpdGxlOiAiRGF0YSBUcmFuc2Zvcm1hdGlvbiIKc3VidGl0bGU6ICJQYXJ0IDIiCm91dHB1dDogaHRtbF9ub3RlYm9vawotLS0KCiMjIyBUYWJsZSBvZiBDb250ZW50cwoKKiBBZGQgbmV3IHZhcmlhYmxlcyB3aXRoIGBtdXRhdGUoKWAKKiBHcm91cGVkIHN1bW1hcmllcyB3aXRoIGBzdW1tYXJpemUoKWAKKiBHcm91cCBieSB3aXRoIGBncm91cF9ieSgpYAoKYGBge3IgZWNobz1GQUxTRX0KIyBsb2FkaW5nIGxpYnJhcmllcwpsaWJyYXJ5KHRpZHl2ZXJzZSkKbGlicmFyeShsdWJyaWRhdGUpCmxpYnJhcnkobnljZmxpZ2h0czEzKQpsaWJyYXJ5KHBsb3RseSkKCiMgbW9kaWZ5aW5nIGNoYXJ0IHNpemUKb3B0aW9ucyhyZXByLnBsb3Qud2lkdGg9NSwgcmVwci5wbG90LmhlaWdodD0zKQpgYGAKCiMjIEFkZCBuZXcgdmFyaWFibGVzIHdpdGggYG11dGF0ZSgpYApCZXNpZGVzIHNlbGVjdGluZyBzZXRzIG9mIGV4aXN0aW5nIGNvbHVtbnMsIGl0J3Mgb2Z0ZW4gdXNlZnVsIHRvIGFkZCBuZXcgY29sdW1ucyB0aGF0IGFyZSBmdW5jdGlvbnMgb2YgZXhpc3RpbmcgY29sdW1ucy4gVGhhdCdzIHRoZSBqb2Igb2YgYG11dGF0ZSgpYC4KCmBtdXRhdGUoKWAgYWx3YXlzIGFkZHMgbmV3IGNvbHVtbnMgYXQgdGhlIGVuZCBvZiB5b3VyIGRhdGFzZXQsIGJ1dCBkb2Vzbid0IGNoYW5nZSB0aGUgb3JpZ2luYWwgZGF0YWZyYW1lLiBJZiB5b3Ugd2FudCB0byBrZWVwIHRoZSBvdXRjb21lIG9mIGBtdXRhdGUoKWAsIG9yIGFueSBvZiB0aGUgb3RoZXIgZnVuY3Rpb25zIHdlIGRpc2N1c3NlZCBpbiB0aGUgcHJldmlvdXMgbm90ZWJvb2ssIGFzc2lnbiB0aGUgb3V0cHV0IHRvIGEgbmV3IG9yIGV4aXN0aW5nIG9iamVjdC4KCmBgYHtyfQojIFN0YXJ0aW5nIHdpdGggYSBzbWFsbGVyIG51bWJlciBvZiBjb2x1bW5zCmZsaWdodHNfc21sIDwtIHNlbGVjdChmbGlnaHRzLCAKICAgICAgICAgICAgICAgICAgICAgIHllYXI6ZGF5LCAKICAgICAgICAgICAgICAgICAgICAgIGVuZHNfd2l0aCgiZGVsYXkiKSwgCiAgICAgICAgICAgICAgICAgICAgICBkaXN0YW5jZSwgCiAgICAgICAgICAgICAgICAgICAgICBhaXJfdGltZQogICAgICAgICAgICAgICAgICAgICApCgpmbGlnaHRzX3NtbApgYGAKCmBgYHtyfQptdXRhdGUoZmxpZ2h0c19zbWwsCiAgICAgICBnYWluID0gZGVwX2RlbGF5IC0gYXJyX2RlbGF5LAogICAgICAgc3BlZWQgPSBkaXN0YW5jZSAvIGFpcl90aW1lICogNjAKICAgICAgKQpgYGAKCkFsdGVybmF0aXZlbHkgd2UgY291bGQndmUgdXNlZCBvdXIgcGlwaW5nIHNraWxsczoKCmBgYHtyfQpmbGlnaHRzICU+JQogIHNlbGVjdCh5ZWFyOmRheSwgCiAgICAgICAgIGVuZHNfd2l0aCgiZGVsYXkiKSwgCiAgICAgICAgIGRpc3RhbmNlLCAKICAgICAgICAgYWlyX3RpbWUKICAgICAgICApICU+JQogIG11dGF0ZShnYWluID0gZGVwX2RlbGF5IC0gYXJyX2RlbGF5LAogICAgICAgICBzcGVlZCA9IGRpc3RhbmNlIC8gYWlyX3RpbWUgKiA2MCkKYGBgCgpOb3RlIHRoYXQgeW91IGNhbiByZWZlciB0byBjb2x1bW5zIHRoYXQgeW91J3ZlIGp1c3QgY3JlYXRlZDoKCmBgYHtyfQptdXRhdGUoZmxpZ2h0c19zbWwsCiAgICAgICBnYWluID0gZGVwX2RlbGF5IC0gYXJyX2RlbGF5LAogICAgICAgaG91cnMgPSBhaXJfdGltZSAvIDYwLAogICAgICAgZ2Fpbl9wZXJfaG91ciA9IGdhaW4gLyBob3VycwogICAgICApCmBgYAoKLS0tCgojIyBHcm91cGVkIHN1bW1hcmllcyB3aXRoIGBzdW1tYXJpemUoKWAKVGhlIGxhc3Qga2V5IHZlcmIgaXMgYHN1bW1hcml6ZSgpYCAob3IgYHN1bW1hcmlzZSgpYCkuIEl0IGNvbGxhcHNlcyBhIGRhdGEgZnJhbWUgdG8gYSBzaW5nbGUgcm93OgoKYGBge3J9CiMgQXZlcmFnZSBkZWxheQpzdW1tYXJpemUoZmxpZ2h0cywgZGVsYXkgPSBtZWFuKGRlcF9kZWxheSwgbmEucm0gPSBUUlVFKSkKYGBgCgpXaGF0IGhhcHBlbnMgaWYgd2UgZG9uJ3Qgc3BlY2lmeSBgbmEucm0gPSBUUlVFYD8KCiMjIyBgZ3JvdXBfYnkoKWAKCmBzdW1tYXJpemUoKWAgaXMgbm90IHRlcnJpYmx5IHVzZWZ1bCB1bmxlc3Mgd2UgcGFpciBpdCB3aXRoIGBncm91cF9ieSgpYC4gVGhpcyBjaGFuZ2VzIHRoZSB1bml0IG9mIGFuYWx5c2lzIGZyb20gdGhlIGNvbXBsZXRlIGRhdGFzZXQgdG8gaW5kaXZpZHVhbCBncm91cHMuIFRoZW4sIHdoZW4geW91IHVzZSB0aGUgZHBseXIgdmVyYnMgb24gYSBncm91cGVkIGRhdGEgZnJhbWUgdGhleSdsbCBhdXRvbWF0aWNhbGx5IGJlIGFwcGxpZWQgImJ5IGdyb3VwIi4gRm9yIGV4YW1wbGUsIGlmIHdlIGFwcGxpZWQgZXhhY3RseSB0aGUgc2FtZSBjb2RlIHRvIGEgZGF0YSBmcmFtZSBncm91cGVkIGJ5IGRhdGUsIHdlIGdldCB0aGUgYXZlcmFnZSBkZWxheSBwZXIgZGF0ZToKCmBgYHtyfQojIEF2ZXJhZ2UgZGVsYXkgZm9yIGVhY2ggZGF5CmZsaWdodHMgJT4lCiAgZ3JvdXBfYnkoeWVhciwgbW9udGgsIGRheSkgJT4lCiAgc3VtbWFyaXplKGRlbGF5ID0gbWVhbihkZXBfZGVsYXksIG5hLnJtID0gVFJVRSkpCmBgYAoKIyMjIEV4YW1wbGVzCgpIb3cgZG9lcyB0aGUgYXZlcmFnZSBkaXN0YW5jZSB0byBkZXN0aW5hdGlvbiBpbXBhY3QgdGhlIGFycml2YWwgZGVsYXkgb24gYXZlcmFnZT8KCmBgYHtyfQpkZWxheSA8LSBmbGlnaHRzICU+JSAKICBncm91cF9ieShkZXN0KSAlPiUgICMgZ3JvdXAgYnkgZGVzdGluYXRpb24KICBzdW1tYXJpemUoY291bnQgPSBuKCksICAjIGNvdW50aW5nIHRoZSBudW1iZXIgb2YgZmxpZ2h0cyAocGVyIGRlc3RpbmF0aW9uKQogICAgICAgICAgICBEaXN0YW5jZSA9IG1lYW4oZGlzdGFuY2UsIG5hLnJtID0gVFJVRSksICAjIGF2ZXJhZ2UgZGlzdGFuY2UgKHBlciBkZXN0aW5hdGlvbikKICAgICAgICAgICAgRGVsYXkgPSBtZWFuKGFycl9kZWxheSwgbmEucm0gPSBUUlVFKSAgIyBhdmVyYWdlIGRlbGF5IChwZXIgZGVzdGluYXRpb24pCiAgICAgICAgICAgKSAlPiUKICBmaWx0ZXIoY291bnQgPiAyMCwgZGVzdCAhPSAiSE5MIikgICMga2VlcGluZyBkZXN0aW5hdGlvbnMgd2l0aCBtb3JlIHRoYW4gMjAgZmxpZ2h0cyB0aGF0IGFyZSBub3QgIkhOTCIKCmRlbGF5CmBgYAoKYGBge3J9CnAgPC0gZ2dwbG90KGRhdGEgPSBkZWxheSwgbWFwcGluZyA9IGFlcyh4ID0gRGlzdGFuY2UsIHkgPSBEZWxheSkpICsKICBnZW9tX3BvaW50KGFlcyhzaXplID0gY291bnQsIGNvbG9yID0gZGVzdCksIGFscGhhID0gMS8zKSArCiAgZ2VvbV9zbW9vdGgoc2UgPSBGQUxTRSkgKwogIGxhYnModGl0bGUgPSAiQXZlcmFnZSBkaXN0YW5jZSB0byBkZXN0aW5hdGlvbiB2cy4gYXJyaXZhbCBkZWxheSIsCiAgICAgIGNhcHRpb24gPSAiVGhlIGNpcmNsZSBzaXplIHNob3dzIHRoZSBudW1iZXIgb2YgZmxpZ2h0cyB0byB0aGF0IGRlc3RpbmF0aW9uLiIpCgpnZ3Bsb3RseShwICsgdGhlbWUobGVnZW5kLnBvc2l0aW9uID0gIm5vbmUiKSkgICMgU2FtZSBwbG90IHdpdGggZ2dwbG90bHkoKQpgYGAKCkl0IGxvb2tzIGxpa2UgZGVsYXlzIGluY3JlYXNlIHdpdGggZGlzdGFuY2UgdXAgdG8gfjc1MCBtaWxlcyBhbmQgdGhlbiBkZWNyZWFzZS4gTWF5YmUgYXMgZmxpZ2h0cyBnZXQgbG9uZ2VyIHRoZXJlJ3MgbW9yZSBhYmlsaXR5IHRvIG1ha2UgdXAgZGVsYXlzIGluIHRoZSBhaXI/CgpDYW4geW91IGZpZ3VyZSBvdXQgd2h5IHdlIHJlbW92ZWQgIkhOTCIgKFBhY2lmaWMvSG9ub2x1bHUpPwoKLS0tCgpMZXQncyBleGFtaW5lIGlmIHRoZSBudW1iZXIgb2YgZmxpZ2h0cyBkdXJpbmcgdGhlIHN1bW1lciBoYXMgaW5jcmVhc2VkOgoKYGBge3J9CmZsaWdodF9jb3VudCA8LSBmbGlnaHRzICU+JQogIG11dGF0ZShkYXRlID0gbWFrZV9kYXRldGltZSh5ZWFyLCBtb250aCwgZGF5KSkgJT4lCiAgZ3JvdXBfYnkoZGF0ZSkgJT4lCiAgc3VtbWFyaXplKGNvdW50ID0gbigpLCAgIyBjb3VudGluZyB0aGUgbnVtYmVyIG9mIGZsaWdodHMgKHBlciBkYXkpCiAgICAgICAgICAgIGRlbGF5ID0gbWVhbihhcnJfZGVsYXksIG5hLnJtID0gVFJVRSkgICMgYXZlcmFnZSBkZWxheSAocGVyIGRheSkKICAgICAgICAgICApCgpwIDwtIGdncGxvdChmbGlnaHRfY291bnQsIGFlcyhkYXRlLCBjb3VudCkpICsKICBnZW9tX2xpbmUoKSArCiAgbGFicyh0aXRsZSA9ICJOdW1iZXIgb2YgZmxpZ2h0cyBvdXQgb2YgTllDIikgKwogIHRoZW1lX2NsYXNzaWMoKQoKZ2dwbG90bHkocCkKYGBgCgpJbnRlcmVzdGluZ2x5LCBhbGwgb2YgdGhlc2UgZGlwcyBhcmUgd2Vla2VuZHMuIExldCdzIGNoZWNrIHRoaXMgYnkgY3JlYXRpbmcgYSBiYXIgY2hhcnQ6CgpgYGB7cn0KZmxpZ2h0cyAlPiUKICBtdXRhdGUoZGF0ZSA9IG1ha2VfZGF0ZXRpbWUoeWVhciwgbW9udGgsIGRheSksCiAgICAgICAgd2Vla2RheSA9IHdkYXkoZGF0ZSwgbGFiZWwgPSBUUlVFKSkgJT4lCiAgZ3JvdXBfYnkod2Vla2RheSkgJT4lCiAgc3VtbWFyaXplKGNvdW50ID0gbigpLCAgIyBjb3VudGluZyB0aGUgbnVtYmVyIG9mIGZsaWdodHMgKHBlciB3ZWVrZGF5KQogICAgICAgICAgICBkZWxheSA9IG1lYW4oYXJyX2RlbGF5LCBuYS5ybSA9IFRSVUUpICAjIGF2ZXJhZ2UgZGVsYXkgKHBlciB3ZWVrZGF5KQogICAgICAgICAgICkgJT4lIAogIGdncGxvdChhZXMod2Vla2RheSwgY291bnQpKSArIAogICAgZ2VvbV9iYXIoc3RhdD0iaWRlbnRpdHkiLCB3aWR0aCA9IDAuNSkgKyAKICAgIGxhYnModGl0bGU9Ik51bWJlciBvZiBmbGlnaHRzIHBlciBkYXQgb2Ygd2VlayIsCiAgICAgICAgeCA9ICJEYXkgb2YgV2VlayIsCiAgICAgICAgeSA9ICJDb3VudCIpICsKICAgIHRoZW1lX2NsYXNzaWMoKSArCiAgICB0aGVtZShheGlzLnRleHQueCA9IGVsZW1lbnRfdGV4dChhbmdsZT02NSwgdmp1c3Q9MC42KSkKYGBgCg==